November 9th, 2016

Common plots

  • Histograms
  • Boxplots
  • Barplots
  • Scatterplots

How would you plot x?

set.seed(20161108)
x <- sample(x = 1:10, size = 1000, replace = TRUE,
    prob = (1:10)/sum(1:10) )
head(x)
## [1]  8  9  8  9  8 10
summary(x)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   5.000   7.000   6.923   9.000  10.000

plot(x)

Boxplot?

boxplot(x)

Histogram?

hist(x)

set.seed(201611082)
plot(seq_len(length(x)), jitter(x, 0.5))

Takeaways

  • It's good to explore your data in more than one plot.
  • We need something where we can easily change the type of plot.
  • Sometimes we need to go beyond and customize our plot a bit it's readability.

What about Stata?

use "x.dta", clear
histogram x

graph box x

Can we make it nicer?

graph box x, box(1, fcolor(dkorange)) ///
    ytitle(Our X variable) ///
    title(A nicer plot) subtitle(Made by Leonardo) ///
    caption(This plot is closer to being finished)

At least three types of plots

  • Quick and dirty: for you or a meeting with close collaborators.
  • For a presentation: make sure people can read it!
  • For a journal

Graphics in Stata

  • Stata 14.1 for Mac

Graphics in R

  • Quick and dirty: either with R-base or ggplot2
  • For a presentation: similar, just a few more options
  • For a journal: any, but might need more details.
  • Other options: https://cran.r-project.org/view=Graphics

Why ggplot2?

library('ggplot2')

## Some example data
head(diamonds)
## # A tibble: 6 × 10
##   carat       cut color clarity depth table price     x     y     z
##   <dbl>     <ord> <ord>   <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.23     Ideal     E     SI2  61.5    55   326  3.95  3.98  2.43
## 2  0.21   Premium     E     SI1  59.8    61   326  3.89  3.84  2.31
## 3  0.23      Good     E     VS1  56.9    65   327  4.05  4.07  2.31
## 4  0.29   Premium     I     VS2  62.4    58   334  4.20  4.23  2.63
## 5  0.31      Good     J     SI2  63.3    58   335  4.34  4.35  2.75
## 6  0.24 Very Good     J    VVS2  62.8    57   336  3.94  3.96  2.48

Nicer defaults

qplot(x = carat, y = price, data = diamonds)

Can easily involve more than two variables

qplot(x = carat, y = price, color = cut, data = diamonds)

Set colors based on the third variable

qplot(x = carat, y = price, color = cut, data = diamonds) +
    scale_color_brewer(palette = 'PuOr')

Separate by a fourth one

qplot(x = carat, y = price, color = cut, data = diamonds) +
    scale_color_brewer(palette = 'PuOr') +
    facet_grid(. ~ color, labeller = label_both)

Change theme

qplot(x = carat, y = price, color = cut, data = diamonds) +
    scale_color_brewer(palette = 'PuOr') +
    facet_grid(. ~ color) + theme_bw(base_size = 20)

Treatment Episode Data

qplot(CASEID, DAYWAIT, data = teds2014)

qplot(CASEID, DAYWAIT, color = GENDER, data = teds2014)

qplot(CASEID, DAYWAIT, color = GENDER, data = teds2014) +
    facet_grid(. ~ GENDER)

library('productplots')
prodplot(teds2014, ~ GENDER)

Paper

prodplot(teds2014, ~ MARSTAT + GENDER)

Actual code:

g1 <- ggplot(data = subset(emp_exons_one_cuts,
        Aligner == 'HISAT'),
    aes(x = FDR, y = Power, shape = StatMethod,
        color = cluster)) +
    geom_point(size = 3) + geom_line() +
    ylab('Empirical power') +
    xlab('Observed FDR (in percent)') +
    theme_linedraw(base_size = 16) +
    scale_color_brewer(palette = 'Set1', name = 'Group') +
    scale_shape_discrete(name = 'Statistical\nmethod')
g1

Source

ggplot2 takeaways

  • Good defaults for quick and dirty
  • Easy to quickly change code and visualize other variables
  • Has an ecosystem that allows making more plots and facilitates changing things (colors, themes, etc)
  • Can be used for all types of plots: quick, presentation, journal-quality

Interactive plots

  • Sometimes we want to play around with our data more
  • Interactive plots go from very simple to super specific
  • How can I visualize my data?
  • shinycsv

Demo time

Case id vs numsubs

graph twoway scatter CASEID NUMSUBS

graph box CASEID, by(NUMSUBS)

Exercise

  • Use the prepared teds2014 subset data: R, Stata, shinycsv, …
  • Does the relationship between education level and gender change by race?

Solution

suppressMessages( library('shinycsv') )
plot_twoway(teds2014$EDUC, teds2014$GENDER, 'educ', 'gender')

df <- subset(teds2014, RACE == 'WHITE')
plot_twoway(df$EDUC, df$GENDER, 'educ', 'gender')

df2 <- subset(teds2014, RACE == 'ASIAN')
plot_twoway(df2$EDUC, df2$GENDER, 'educ', 'gender')

How I made the example data

## Download http://wwwdasis.samhsa.gov/dasis2/teds_pubs/2014/Admissions/teds_a_2014_r.zip
load('teds_a_2014.rda')
set.seed(20161109)
teds2014 <- teds_a_2014[sample(seq_len(nrow(teds_a_2014)), 1e4), ]
save(teds2014, file = 'teds2014.Rdata')
rio::export(teds2014, file = 'teds2014.dta')

More information

Reproducibility info

R.version.string
## [1] "R version 3.3.1 Patched (2016-10-18 r71535)"
packageVersion('ggplot2')
## [1] '2.1.0'
packageVersion('productplots')
## [1] '0.1.1'
packageVersion('shinycsv')
## [1] '0.99.7'